LS4003 R tutorial 2

Correlations in R

In this tutorial we’re going to use R to plot correlations, including a line of best fit, R value and P value.

See example below for the end point:

Make sure you’ve completed the tutorial 1 section on using R from excel before starting here.

Install and Set-up

A refresher for how to install and set up R and RStudio.

To get set up, follow the below steps. Click each step to see the instruction and the screenrecording.

  1. Type in AppsAnywhere to the windows bar. This will open in a web browser
  2. Type in RStudio in AppsAnywhere
  3. Click “Launch” and wait for it to install and open.

GIF of Opening RStudio

GIF of Opening RStudio
  1. Copy and paste the following into the Console on the left, and press enter.

setwd("O:/")

  1. Click the “More” cog and select “Go To Working Directory”

You should now be in your OneDrive. You should be able to recognise the files and folders listed, from what you have saved here in your other classes.

GIF of Opening OneDrive

GIF of Opening OneDrive

Occasionally there is an issue with how OneDrive is loaded on the University computer.

If you get the error message:

Error in setwd("O:/") : cannot change working directory

Then try the following. Replace the underscores with your K number.

setwd("C:/Users/K______/")

Click the “More” cog and select “Go To Working Directory”

Then find and click on the folder:

OneDrive Kingston University

Click the more cog and select Set As Working Directory

GIF of Opening OneDrive with common error

GIF of Opening OneDrive with common error

If you don’t already have a folder for LS4003 Statistics, then you can create one by clicking “New Folder” and entering a name.

If your new folder doesn’t appear, click the refresh button (to the right of the more cog).

Then:

  1. Click into your new folder

  2. Click the More Cog and select “Set As Working Directory”

GIF of Making a Folder

GIF of Making a Folder

Once you’re in your folder you can create and save an R file. This is where you put your code.

  1. Click on the Green Plus icon and select “R Script”
  2. In the top bar, click “File” and then “Save”
  3. Give your file a name (e.g. “R_tutorial_1” )

When you make any changes, you can save the file by going File -> Save.

You can also save by holding down Control and S at the same time.

GIF of Making an R File

GIF of Making an R File
Note

This will automatically add the “.R” extension so we know it’s an R file - R_tutorial_1.R

Warning

Make sure you can find your file in file explorer. Always back up your work such as saving in OneDrive or emailing to yourself so that you don’t lose your progress.

You’re now ready to run some R code!

  1. Copy and paste the following into your R file:
value <- "Hello World"
value
  1. Highlight both lines and click the “Run” icon (green arrow)

You should see a result in your console (bottom left panel) and your environment (top right panel)

You’re now ready to work through the worksheet! As you go, try and figure out what each bit of code is doing. What happens if you change something?

GIF of Running an R File

GIF of Running an R File

This is an online, cloud-based option. It’s a bit more limited than running on a university computer or your own computer, but the free option should be enough for this module.

Go to Posit Cloud and create a free account

Log in, then go to New Project -> New RStudio Project.

Make a new folder in the bottom right panel (by clicking the New Folder button) called “LS4003_Statistics”.

Click on this folder to enter it, and then click the More cog (bottom right panel) and select “Set as Working Directory”.

To run R on your own machine, you have to install R (the programming language) and RStudio (the development environment).

When installing, click the most appropriate option for your machine (Windows/Mac/Linux)

Install R

Install RStudio

Once you have installed both, open RStudio.

Navigate to your Documents folder in bottom right panel. (If you can’t find it, type in setwd("~/Documents") to the console on the bottom left, then click the More cog on the bottom right and select “Go to Working Directory”)

Create a new folder called LS4003_Statistics by clicking the New Folder button on the right hand side.

Click on your folder (LS4003_Statistics) to enter it.

Set that as your final working directory by clicking on the ‘More’ cog icon again and select “Set as Working Directory”.

Import data from CSV file

The dataset we are going to use is heart.csv which you can find on the Canvas page.

This dataset uses various metrics relating to heart health.

heart.csv dataset
column data
age age of patient in years
sex Gender of the patient (F/M)
restbp Resting blood pressure (mmHg)
chol Serum cholesterol (mg/dl)
maxheartrate Maximum heart rate achieved
Question

Which of these columns contain categorical data and which contain continuous data?

Plotting age verses resting blood pressure

We could speculate that resting blood pressure may increase with age. To plot this, we can use the ggplot function geom_point.

Calculating Pearson’s coefficient

We can use two R functions to calculate our R and p values using a Pearson’s coefficient.

Using cor()

Our first example uses cor(). This gives us the R value - how strong is the correlation?

Let’s break that down:

  • cor(heart_df$age, heart_df$restbp, method = 'pearson')
    • cor() : This is our function name. This is a built in function.
    • heart_df$age : This is going to our dataframe called heart_df and extracting the column of values under the column name “age”
    • heart_df$restbp : This is going to our dataframe called heart_df and extracting the column of values under the column name “restbp”
  • method = 'pearson'
    • This is a parameter, which is an option to choose how we want the function to work.
    • We can set this to either 'pearson' or 'spearman' depending on which test we want to use.

Using cor.test()

Our second example uses cor.test(). This gives us the R value - how strong is the correlation?

As you can see that works very similarly to our first example, except for the last two lines.

cor.test() gives us a list of 9 values, there are only two we are interested in:

  • p.value is our probability p value
  • estimate is our correlation coefficient (R value, same as above)

Annotate the R and p values onto the scatter graph

To annotate our R and p values onto a scatter graph, we can use the stat_cor function from the ggpubr package.

If you’ve not already installed it, make sure that first you run:

install.packages('ggpubr')

The only parameter we used here for stat_cor was method = 'pearson' so that it would plot a pearson’s correlation.

Question

Do the R and p values annotated on this graph match the results from cor.test()?

Moving the R and P values

We can use parameters label.x and label.y to add co-ordinates for where we want the R and p value annotations to go.

Try it below:

Tip

If you’d like the R and p value to each be on their own line, we can add label.sep = '\n' which adds a \n newline character

What happens if you add label.sep = 'HELLO'?

Add a regression line

We can also use the funtion geom_smooth to fit and plot a linear regression model (lm) to our graph.

We use formula= y~x to define y as the outcome variable and x as the predictor: older age (x axis, predictor) leads to higher resting blood pressure (y axis, predicted outcome.)

Separate by categorical group (sex)

If we add a color based on a categorical variable, this will then calculate the regression separately for each group.

See below, where we have assigned each point a colour based on sex:

Correlelograms

Our dataset contained more than just age and blood pressure - it would be useful to see if there are any correlations between e.g. blood pressure and maximum heart rate, or cholesterol and age.

We can also use R to do an all-against-all correlation analysis, so without doing all of the above we can get an idea of if there are any correlations.

This uses the corrplot library. You might need to run install.packages('corrplot') before you can run this code.

Let’s break that down:

  • library(corrplot) is loading the corrplot package, which has the functions we need

  • heart_df_only_numerical <- heart_df[,-2]

    • Our sex in this dataframe is categorical (M or F) so we need to remove it

    • heart_df[,-2] is selecting the whole of the heart_df dataframe except column 2

  • heart_corrplot_matrix <- cor(heart_df_only_numerical)

    • This uses the cor() function to calculate all-against-all correlations
  • corrplot(heart_corrplot_matrix

    • This takes our all-against-all correlations and plots a correlelogram
Try:

Corrplot also has different settings you can use.

To plot the strength of the correlation (R value) numerically, try: corrplot(heart_corrplot_matrix, method = "number")

And if you only want to show the correlations that are statistically significant with a 5% chance of error:

corrplot(heart_corrplot_matrix, sig.level = 0.05)

That’s the end of the tutorial - now move on to Worksheet 2.

Comic on correlation from xkcd

Comic on correlation from xkcd